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Abstract 

Alternating minimization represents a widely applicable and empirically successful approach 
for finding low-rank matrices that best fit the given data. For example, for the problem of low- 
rank matrix completion, this method is believed to be one of the most accurate and efficient, 
and formed a major component of the winning entry in the Netflix Challenge [15, 16]. 

In the alternating minimization approach, the low-rank target matrix is written in a bi-linear 
form, i.e. X = UV^; the algorithm then alternates between finding the best U and the best 
V. Typically, each alternating step in isolation is convex and tractable. However the overall 
problem becomes non-convex and there has been almost no theoretical understanding of when 
this approach yields a good result. 

In this paper we present first theoretical analysis of the performance of alternating minimiza- 
tion for matrix completion, and the related problem of matrix sensing. For both these problems, 
celebrated recent results have shown that they become well-poscd and tractable once certain 
(now standard) conditions are imposed on the problem. We show that alternating minimization 
also succeeds under similar conditions. Moreover, compared to existing results, our paper shows 
that alternating minimization guarantees faster (in particular, geometric) convergence to the 
true matrix, while allowing a simpler analysis. 

1 Introduction 

Finding a low-rank matrix to fit / approximate observations is a fundamental task in data analysis. 
In a slew of applications, a popular empirical approach has been to represent the target rank k 
matrix X G K''»x'^ in a bi-linear form X = C/W, where U G M"*^'^ and V G M"^'^. Typically, this 
is done for two reasons: 

(a) Size and computation: If the rank k of the target matrix (to be estimated) is much smaller 
than m, n, then U, V are significantly smaller than X and hence are more efficient to optimize for. 
This is crucial for several practical applications, e.g., recommender systems where one routinely 
encounters matrices with billions of entries. 

(b) Modeling: In several applications, one would like to impose extra constraints on the target 
matrix, besides just low rank. Oftentimes, these constraints might be easier and more natural to 
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impose on factors U, V. For example, in Sparse PCA [22], one looks for a low-rank X that is the 
product of sparse U and V. 

Due to the above two reasons, in several applications, the target matrix X is parameterized by 
X = UV^ . For example, clustering [14], sparse PCA [22] etc. 

Using the bi-linear parametrization of the target matrix X, the task of estimating X now reduces 
to finding U and V that, for example, minimize an error metric. The resulting problem is typically 
non-convex due to bi-linearity. Correspondingly, a popular approach has been to use alternating 
minimization: iteratively keep one of U, V fixed and optimize over the other, then switch and 
repeat, see e.g. [16]. While the overall problem is non-convex, each sub-problem is typically convex 
and can be solved efficiently. 

Despite wide usage of bi-linear representation and alternating minimization, there has been to 
date almost no theoretical understanding of when such a formulation works. Motivated by this 
disconnect between theory and practice in the estimation of low-rank matrices, in this paper we 
provide the first guarantees for performance of alternating minimization, for two low-rank 
matrix recovery problems: matrix completion, and matrix sensing. 

Matrix completion involves completing a low-rank matrix, by observing only a few of its ele- 
ments. Its recent popularity, and primary motivation, comes from recommendation systems [16], 
where the task is to complete a user-item ratings matrix using only a small number of ratings. As 
elaborated in Section 2, alternating minimization becomes particularly appealing for this problem 
as it provides a fast, distributed algorithm that can exploit both sparsity of ratings as well as the 
low-rank bi-linear parametrization of X. 

Matrix sensing refers to the problem of recovering a low-rank matrix M G '^'mxn ^j-q^ affine 
equations. That is, given d linear measurements bi = tr{A\M) and measurement matrices Aj's, 
the goal is to recover back M. This problem is particularly interesting in the case of d <C mn 
and was first studied in [19] and subsequently in [10, 17]. In fact, matrix completion is a special 
case of this problem, where each observed entry in the matrix completion problem represents one 
single-element measurement matrix Ai. 

Without any extra conditions, both matrix sensing and matrix completion are ill-posed prob- 
lems, with potentially multiple low-rank solutions, and are in general NP hard [18]. Current work 
on these problems thus impose some extra conditions, which makes the problems both well defined, 
and amenable to solution via the respective proposed algorithms [19, 3]; see Section 3 for more 
details. In this paper, we show that under similar conditions to the ones used by the exist- 
ing methods, alternating minimization also guarantees recovery of the true matrix; we also show 
that it requires only a small number of computationally cheap iterations and hence, as observed 
empirically, is computationally much more efficient than the existing methods. 
Notations: We represent a matrix by capital letter (e.g. M) and a vector by small letter (u). 
Ui represents i-th element of u and Uij denotes (i,j)-th entry of U. Ui represents i-th column of 
U and f/'-*^ represents z-th row of U . A'^ denotes matrix transpose of ^. u = vec{U) represents 
vectorized U, i.e., u = [ul ul ■■■ ul]^ . \\u\\p denotes Lp norm of u, i.e., \\u\\p = C^i\ui\^)^^^ ■ 
By default, \\u\\ denotes L2 norm of u. \\A\\p denotes Frobenius norm of A, i.e., ]]t;ec(A)]]2. 

= maXj, ||a,||2=i ll^a;]]2 denotes spectral norm of A. tr{A) denotes the trace (sum of diago- 
nal elements) of square matrix A. Typically, U, V represent factor matrices (i.e., U G ]^"^x'^ and 
V G M"^'^) and U, V represent their orthonormal basis. 
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2 Our Results 



In this section, we will first define the matrix sensing problem, and present our results for it. 
Subsequently, we will do the same for matrix completion. The matrix sensing setting - i.e. recovery 
of any low-rank matrix from linear measurements that satisfy matrix RIP - represents an easier 
analytical setting than matrix completion, but still captures several key properties of the problem 
that helps us in developing an analysis for matrix completion. We note that for either problem, ours 
represent the first global optimality guarantees for alternating minimization based algorithms. 

Matrix Sensing via Alternating Minimization 

Given d linear measurements bi = {M,Ai) = tr{AlM), 1 < i < d of an unknown rank-fc matrix 
M S i^"^x" and the sensing matrices Ai, 1 < i < d, the goal in matrix sensing is to recover back M. 
In the following we collate these coefficients, so that 6 G R'^ is the vector of 6j's, and A{-) : M™^" — ). d 
is the corresponding linear map, with b = A{M). With this notation, the Low-Rank Matrix Sensing 
problem is: 

Find X G M"^", s.t A{X) = b, rank{X) < k. (LRMS) 

As in the existing work [19] on this problem, we are interested in the under-determined case, 
where d < mn. Note that this problem is a strict generalization of the popular compressed sensing 
problem [4]; compressed sensing represents the case when M is restricted to be a diagonal matrix. 

For matrix sensing, alternating minimization approach involves representing X as a product 
of two matrices U € jgmxfc ^nd V G R"^*^, i.e., X = UV^ . If k is (much) smaller than m,n, 
these matrices will be (much) smaller than X. With this bi-linear representation, alternating 
minimization can be viewed as an approximate way to solve the following non-convex optimization 
problem: 

min \\A{UV^) -b\\l 

As mentioned earlier, alternating minimization algorithm for matrix sensing now alternately solves 
for U and V while fixing the other factor. See Algorithm 1 for a pseudo-code of AltMinSense 
algorithm that we analyze. 

We note two key properties of AltMinSense : a) Each minimization - over U with V fixed, and 
vice versa - is a simple least-squares problem, which can be solved in time 0{dn'^k'^ + n'^k^y , b) We 
initialize to be the top-A; left singular vectors of Aibi (step 2 of Algorithm 1). As we will see 
later in Section 4, this provides a good initialization point for the sensing problem which is crucial; 
if the ffi'st iterate is orthogonal, or almost orthogonal, to the true U* subspace, AltMinSense 
may never converge to the true space (this is easy to see in the simplest case, when the map is 
identity, i.e. A{X) = X - m. which case AltMinSense just becomes the power method). 

In general, since d < mn, problem (LRMS) is not well posed as there can be multiple rank-A: 
solutions that satisfy A{X) = b. However, inspired by a similar condition in compressed sensing [4], 
Recht et al. [19] showed that if the linear map A satisfies a (matrix) restricted isometry property 
(RIP), then a trace- norm based convex relaxation of (LRMS) leads to exact recovery. This property 
is defined below. 

^Throughout this paper, we assume m < n. 
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Algorithm 1 AltMinSense : Alternating minimization for matrix sensing 
1: Input 6, A 

2: Initialize to be the top-k left singular vectors of Aibi 
3: for t = 0, • • • , T - 1 do 

4: ^ argminv.gK„xfe _ ^||2 

5: ^ argmin^gK'-xfc P(C/ - 

6: end for 

7: Return X = f/^(i>^)t 



Definition 2.1. /iP/ ^ linear operator A{-) : M""^" is said to satisfy k-RIP, with 6k RIP 

constant, if for all X £ j^'^x" g.t. rank{X) < k, the following holds: 

(1 - h) \\X\\l < \\A{X)\\l < (1 + 6k) \\X\\l . (1) 

Several random matrix ensembles with sufficiently many measurements (d) satisfy matrix RIP 
[19]. For example, if d = Q{^knlogn) and each entry of Ai is sampled i.i.d. from a 0-mean 

k 

sub-Gaussian distribution then A;- RIP is satisfied with RIP constant 6k- 
We now present our main result for AltMinSense. 

Theorem 2.2. Let M = U*T,*V*'' be a rank-k matrix with non zero singular values > • • • > 
cr^. Also, let the linear measurement operator A{-) : M'"^"- — ). satisfy 2k-RIP with RIP constant 

^2k < (ct*)^ TsW' I'hen, in the AltMinSense algorithm (Algorithm 1), for all T > 21og(||M||j?/e), 

the iterates and satisfy: 

\\M -U^{V^)^\\f <e. 

The above theorem establishes geometric convergence (in 0(log(l/e)) steps) of AltMinSense to 
the optimal solution of (LRMS) under standard RIP assumptions. This is in contrast to existing 
iterative methods for trace- norm minimization all of which require at least 0{-^) steps; interior 
point methods for trace-norm minimization converge to the optimum in 0(log(l/e)) steps but 
require storage of the full m x n matrix and require O(n^) time per step, which makes it infeasible 
for even moderate sized problems. 

Recently, several projected gradient based methods have been developed for matrix sensing 
[10, 17] that also guarantee convergence to the optimum in 0(log(l/e)) steps. But each iteration in 
these algorithms requires computation of the top k singular components of an m x n matrix, which 
is typically significantly slower than solving a least squares problem (as required by each iteration 
of AhMinSense). 

Stagewise AltMinSense Algorithm: A drawback of our analysis for AltMinSense is the de- 
pendence of 62k on the condition number {k = ^) of M, which implies that the number of mea- 
surements d required by AltMinSense grows quadratically with k. We address this issue by using 
a stagewise version of AltMinSense (Algorithm 3) for which we are able to obtain near optimal 
measurement requirement. 

The key idea behind our stagewise algorithm is that if one of the singular vectors of M is very 
dominant, then we can treat the underlying matrix as a rank-1 matrix plus noise and approximately 
recover the top singular vector. Once we remove this singular vector from the measurements, we 
will have a relatively well-conditioned problem. Hence, at each stage of Algorithm 3, we seek to 
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remove the remaining most dominant singular vector of M. See Section 6 for more details; here we 
state the corresponding theorem regarding the performance of Stage- AltMin. 

Theorem 2.3. Let M = (7*S*y*^ he a rank-k incoherent matrix with non zero singular values 
<7][ > • • • > o"^. Also, let A{-) : M™^" M.'^ be a linear measurement operator that satisfies 
2k-RIP with RIP constant 62k < 32cM^' Suppose, Stage- AltMin (Algorithm 3) is supplied inputs 
A, h = A{M). Then, the i-th stage iterates Uj.^, V^^ satisfy: 

\\M - Ul, {V^,,)^ III < max(e, lQk{<j*^^f), 
where T = (log(||M|H/e)) . That is, the T-th step iterates of the k-th stage, satisfy: \\M — 

fjT (yT \t ||2 < 

The above theorem guarantees exact recovery using 0(A;^nlogn) measurements which is only 
0{k^) worse than the information theoretic lower bound. We also note that for simplicity of 
analysis, we did not optimize the constant factors in 52k- 



Matrix Completion via Alternating Minimization 

The matrix completion problem is the following: there is an unknown rank-fc matrix M £ M™^", 
of which we know a set Q C [m] x [n] of elements; that is, we know the values of elements Mjj, for 
{i,j) € il. The task is to recover M. Formally, the Low-Rank Matrix Completion problem is: 

Find rank-A; matrix X s.t. Pq{X) = Pq{M), (LRMC) 

where for any matrix S and a set of elements Q C [m] x [n] the matrix PniS) G M™^" is as defined 
below: 

PdSh = {'-' (2) 
ID otherwise. 

We are again interested in the under-determined case; in fact, for a fixed rank k, as few as O(nlogn) 
elements may be observed. This problem is a special case of matrix sensing, with the measurement 
matrices Ai = eje^ being non-zero only in single elements; however, such matrices do not satisfy 
matrix RIP conditions like (1). For example, consider a low-rank M = eie\ for which a uniformly 
random 17 of size O(nlogn) will most likely miss the non-zero entry of M. 

Nevertheless, like matrix sensing, matrix completion has been shown to be possible once addi- 
tional conditions are applied to the low-rank matrix M and the observation set Q. Starting with the 
first work [3], the typical assumption has been to have generated uniformly at random, and M 
to satisfy a particular incoherence property that, loosely speaking, makes it very far from a sparse 
matrix. In this paper, we show that once such assumptions are made, alternating minimization 
also succeeds. We now restate, and subsequently use, this incoherence definition. 

Definition 2.4. [3] A matrix M G ]g"ix" {g incoherent with parameter fj, if: 



< V i G H, < ^ V J G [n], (3) 



2 \/m 



2 \ n 



where M = UT.V'^ is the SVD of M and u^^\ v^^^ denote the i^^ row of U and the j*'* row of V 
respectively. 
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The alternating minimization algorithm can be viewed as an approximate way to solve the 
following non-convex problem: 



min \\Pn{UV^) - Pn{M)\\l 

Similar to AltMinSense, the altmin procedure proceeds by alternatively solving for U and V. 
As noted earlier, this approach has been popular in practice and has seen several variants and 
extensions being used in practice [21, 16, 15, 7]. However, for ease of analysis, our algorithm further 
modifies the standard alternating minimization method. In particular, we introduce partitioning 
of the observed set O, so that we use different partitions of in each iteration. See Algorithm 2 
for a pseudo-code of our variant of the alternating minimization approach. 

Algorithm 2 AltMinComplete: Alternating minimization for matrix completion 
1: Input: observed set 0, values Pq{M) 

2: Partition Q into 2T + 1 subsets f^o, • • • , ^2T with each element of ft belonging to one of the ilt 

with equal probability (sampling with replacement) 
3: U° = SVD{^Pno{M), k) i.e., top-A; left singular vectors of iPno(Af) 

4: Clipping step : Set all elements of U'^ that have magnitude greater than to zero and 

orthonormalize the columns of 
5: for t = 0, • • • , T - 1 do 

6: ^ argminv.gr xfe \\Pn,^,{U'V^ - M)\\l 

r. [/*+! ^ argmin^gu-x. ||Pf7^+,+i (f/ - M)fp 

8: end for 

9: Return X = [/^(y^)t 



We now present our main result for (LRMC): 

Theorem 2.5. Let M = [/*$]* F*^ G j^mxn > m) he a rank-k incoherent matrix, i.e., both U* 
and V* are ^-incoherent (see Definition 2.4)- Also, let each entry of M be observed uniformly and 
independently with probability, 



P>C 



fi k ° log n log 



k\\Mh 



where 5- 



2k ^ i2kc7* C > Q is a global constant. Then w.h.p. for T = C' log H-^^^iE^ the outputs U 



and V of Algorithm 2, with input (Vl^P^ipiy) (see Equation (2)) satisfy: 



M 



< e. 



The above theorem implies that by observing |0| = O ^(^)^A;'^'^nlognlog(A:||M||i7/e)j random 

entries of an incoherent M, AltMinComplete can recover M in 0(log(l/e)) steps. In terms of 
sample complexity our results show alternating minimization may require a bigger Vt than 

convex optimization, as our result has \Vt\ depend on the condition number, required accuracy 
(e) and worse dependence on k than known bounds. In contrast, trace- norm minimization based 
methods require 0(A;nlogn) samples only. 
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Empirically however, this is not seen to be the case [10] and we leave further tightening of the 
sample complexity bounds for matrix completion as an open problem. 

In terms of time complexity, we show that AltMinComplete needs time 0(|0|A;^ log(l/e)). This 
is in contrast to popular trace-norm minimization based methods that need 0{l/y/e) steps [1] 
and total time complexity of 0{\Vt\n/ y/e)\ note that the latter can be potentially quadratic in n. 
Furthermore, each step of such methods requires computation of the SVD of an m x n matrix. As 
mentioned earlier, interior point methods for trace-norm minimization also converge in 0(log(l/e)) 
steps but each iteration requires O(n^) steps and need storage of the entire m x n matrix X. 

3 Related Work 

Alternating Minimization: Alternating minimization and its variants have been applied to 
several low-rank matrix estimation problems. For example, clustering [14], sparse PCA [22], non- 
negative matrix factorization [13], signed network prediction [9] etc. There are three main reasons 
for such wide applicability of this approach: a) low-memory footprint and fast iterations, b) flexible 
modeling, c) amenable to parallelization. However, despite such empirical success, this approach 
has largely been used as a heuristic and has had no theoretical analysis other than the guarantees of 
convergence to the local minima [20]. Ours is the first analysis of this approach for two practically 
important problems: a) matrix completion, b) matrix sensing. 

Matrix Completion: This is the problem of completing a low-rank matrix from a few sampled 
entries. Candes and Recht [3] provided the first results on this problem, showing that under the 
random sampling and incoherence conditions (detailed above), 0{kn^'^ logn) samples allow for 
recovery via convex trace-norm minimization; this was improved to 0(fcn log n) in [5]. For large 
matrices, this approach is not very attractive due to the need to store and update the entire matrix, 
and because iterative methods for trace norm minimization require 0{-^) steps to achieve additive 
error of e. Moreover, each such step needs to compute an SVD. 

Another approach, in [12], involved taking a single SVD, followed by gradient descent on a 
Grassmanian manifold. However, (a) this is more expensive than alternating minimization as 
it needs to compute gradient over Grassmanian manifold which in general is a computationally 
intensive step, and (b) the analysis of the algorithm only guarantees asymptotic convergence, and 
in the worst case might take exponential time in the problem size. 

Recently, several other matrix completion type of problems have been studied in the literature. 
For example, robust PCA [6, 2], spectral clustering [11] etc. Here again, under additional assump- 
tions, convex relaxation based methods have rigorous analysis but alternating minimization based 
algorithms continue to be algorithms of choice in practice. 

Matrix Sensing: The general problem of matrix sensing was first proposed by [19]. They es- 
tablished recovery via trace norm minimization, assuming the sensing operator satisfies "restricted 
isometry" conditions. Subsequently, several other methods [10, 17] were proposed for this problem 
that also recovers the underlying matrix with optimal number of measurements and can give an 
e-additive approximation in time 0(log(l/e). But, similar to matrix completion, most of these 
methods require computing SVD of a large matrix at each step and hence have poor scalability to 
large problems. 

We show that AltMinSense and AltMin-Completion provide more scalable algorithms for their 
respective problems. We demonstrate that these algorithms have geometric convergence to the 
optima, while each iteration is relatively cheap. For this, we assume conditions similar to those 
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required by existing algorithms; albeit, with one drawback: number of samples required by our 
analysis depend on the condition number of the underlying matrix M. For the matrix sensing 
problem, we remove this requirement by using a stagewise algorithm; we leave similar analysis for 
matrix completion as an open problem. 



4 Matrix Sensing 

In this section, we study the matrix sensing problem (LRMS) and prove that if the measurement 
operator, satisfies RIP then AltMinSense (Algorithm 1) recovers the underlying low-rank matrix 
exactly (see Theorem 2.2). 

At a high level, we prove Theorem 2.2 by showing that the "distance" between subspaces 
spanned by (iterate at time t) and V* decreases exponentially with t. This done based on 
the observation that once the (standard) matrix RIP condition (Definition 2.1) holds, alternating 
minimization can be viewed, and analyzed, as a perturbed version of the power method. 
This is easiest to see for the rank-1 case below; we detail this proof, and then the more general 
rank- A: case. 

In this paper, we use the following definition of distance between subspaces: 

Definition 4.1. [8] Given two matrices U,W £ M™^'^, the (principal angle) distance between the 
subspaces spanned by the columns of U and W is given by: 



dist U, W 



def 



u\w 



w\u 



where U and W are orthonormal bases of the spaces Spanyilj and Span\Wj, respectively. Simi- 
larly, U± and W± are any orthonormal bases of the perpendicular spaces Span (C/)"*" and Span (W)'^ , 
respectively. 

Note: (a) The distance depends only on the spaces spanned by the columns of U, W, (b) if the 
ranks of U and W (i.e. the dimensions of their spans) are not equal, then dist ^?7, = 1, and 

(c) dist ^f7, = if and only if they span the same subspace of W^. 

We now present a theorem that bounds the distance between the subspaces spanned by and 
V* and show that it decreases exponentially with t. 

Theorem 4.2. Let b = A{M) where M and A satisfy assumptions given in Theorem 2.2. Then, 
the {t + l)-th iterates U^+^, V^+^ of AltMinSense satisfy: 

dist (V^+\V*^ < ^ • dist (u\ U' 

dist f^*) < ^ • dist (v'+\ V 

where dist {U,W) denotes the principal angle based distance (see Definition 4-1)- 
Using Theorem 4.2, we are now ready to prove Theorem 2.2. 
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Proof Of Theorem 2.2. Assuming correctness of Theorem 4.2, Theorem 2.2 follows by using the 
following set of inequalities: 



Ci \_ 

<:p^P(M(/-y^(F^)t))||2, 
1 - 02k 



1 - 02k 

I i±i^l|M||^dist2 (y^y*)<e, 

where V'^ is an orthonormal basis of and follow by RIP, (^2 holds as is the least 

squares solution, (^4 follows from the definition of dist(-, •) and finally (^5 follows from Theorem 4.2 
and by setting T appropriately. □ 

To complete the proof of Theorem 2.2, we now need to prove Theorem 4.2. In the next section, 
we illustrate the main ideas of the proof of Theorem 4.2 by applying it to a rank-1 matrix i.e., when 
k = \. We then provide a proof of Theorem 4.2 for arbitrary k in Section 4.2. 

4.1 Rank-1 Case 

In this section, we provide a proof of Theorem 4.2 for the special case oi k = \. That is, let 
M = u*(j*{v*)'^ s.t. u* e M"*, ||n*||2 = 1 and v* G M"', ||f*||2 = 1- Also note that when u and w are 
vectors, d\st{u,w) = 1 — {v)w)'^, where u = u/||u||2 and w = ■u;/||'uj||2. 

Consider the t-ih. update step in the AltMinSense procedure. As u*"*"^ = 

argmiu:^ Yl'i=i {u^'^ J^^v — cj*u*^ aJu*^ , setting the gradient of the above objective function to 0, we 
obtain: 

where = SV||m*||2. Now, let B = J2i=i Aiu\u^)^Al and C = J^Li Aiu\u*)^Al. Then, 

\u'\\2v'+^ = a*B-^Cv*, 

{u*,u')a*v* - B-^ {{u*,u')B - C) a*v* . (4) 



Power Method Error Term 

Note that the first term in the above expression is the power method iterate (i.e., M^u^). The 
second term is an error term and the goal is to show that it becomes smaller as gets closer to 
u* . Note that when tt* = u* , the error term is irrespective of the measurement operator A. 
Below, we provide a precise bound on the error term: 

Lemma 4.3. Consider the error term defined in (4) and let A satisfy 2-RIP with constant 62- 
Then, 

\\B~^ {{u*,u')B - C) v*\\ < -^^y'l-{ut,u*)^ 

i — 602 



9 



See Appendix B.l for a detailed proof of the above lemma. 
Using the above lemma, we now finish the proof of Theorem 4.2: 



Proof of Rank-1 case of Theorem 4-2. Let v ~^ = v /\\v^ \\2- Now, using (4) and Lemma 4.3: 



^ {u*,u') -62^/1- {u*,u'y 



where 62 = ir^- That is. 



dist^(t;*+\7;*) < 2V \ 1 / ; 



Hence, assuming {u*,v!-) > 562, dist(t'*+-^, v*) < |dist(u*, u*). As dist(n*+^, u*) and dist(t;*+-^, u *) 
are decreasing with t (from the above bound), we only need to show that (u'^jU^) > 562- Recall 
that liP is obtained by using one step of SVP algorithm [10]. Hence, using Lemma 2.1 of [10] (see 
Lemma A.l): 

\\al{I - u\uy)u*)\\l < \\M - u°{iP)^\\j, < 2(52||M|||. 
Therefore, {u^,u*) > \/l — 262 > 562 assuming 62 < □ 



4.2 Rank-A: Case 

In this section, we present the proof of Theorem 4.2 for arbitrary k, i.e., when M is a rank-/c matrix 
(with SVD C/*S* (V*)^). 

Similar to the analysis for the rank-1 case (Section 4.1), we show that even for arbitrary k, the 
updates of AltMinSense are essentially power-method type updates but with a bounded error term 
whose magnitude decreases with each iteration. 

However, directly analyzing iterates of AltMinSense is a bit tedious due to non-orthonormality 
of intermediate iterates U. Instead, for analysis only we consider the iterates of a modified version 
of AltMinSense, where we explicitly orthonormalize each iterate using the QR-decomposition^. In 
particular, suppose we replace steps 4 and 5 of AltMinSensewith the following 

[/* = (QR decomposition), 

^ argmin P(C/*W) - b\\l, 
V 

yt+i ^ yt+ijit+i ^Qp, decomposition) 

f/t+i ^ argmin \\A{U{V'+'^)^) - b\\l (5) 
u 

^The QR decomposition factorizes a matrix into an orthonormal matrix (a basis of its column space) and an upper 
triangular matrix; that is given S it computes S = SR where 5 has orthonormal columns and R is upper triangular. 
If S is full-rank, so are 5 and R. 
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In our algorithm, in each iterate both [/*, remain full-rank because dist (C/*, C/*) < 1; with this, 
the following lemma implies that the spaces spanned by the iterates in our AltMinSense algorithm 
are exactly the same as the respective ones by the iterates of the above modified version (and hence 
the distances dist(?7*, U*) and dist(y*, V*) are also the same for the two algorithms). 

Lemma 4.4. Let [/* be the t^^ iterate of our AltMinSense algorithm, and C/* of the modified version 
stated above. Suppose also that both f/*, [/* are full-rank, and span the same subspace. Then the 
same will be true for the subsequent iterates for the two algorithms, i.e. Span{V^^^) = Span{V^^^), 
Span{U^^^) = Span{U^~^^) , and all matrices at iterate t + 1 will be full-rank. 

The proof of the above lemma can be found in Appendix B.2. In light of this, we will now prove 
Theorem 4.2 with the new QR-based iterates (5). 

Lemma 4.5. Let [/* be the t-th step iterate of AltMinSense and let C/*, V^^^ and V^~^^ be obtained 
by Update (5). Then, 



(6) 



Power-method Error 
Update Term 



where F is an error matrix defined in (8) and R^^^^'^ is a triangular matrix obtained using QR- 
decomposition ofV^'^^. 

See Appendix B for a detailed proof of the above lemma. 

Before we give an expression for the error matrix F, we define the following notation. Let 



E W^^ be given by: 



vec{V* 



I.e., V 



1^2 



Define B, C, D, S as follows: 



B 



D 



dcf 



def 



' Bn ■■ 


Bik 




' Cn ■ 




_ Bki • • 


Bkk 




_ Cki ■ 


Ckk 


" Dn ■ 


■ Dik ' 


, S 


(^ILi ■ 


■ o„ ' 


_ Dki ■ 


■ Dkk _ 




_ On . 





(7) 



where , for 1 < p, g < A;: Bpq 

def 



dcf 



Ed 
i=l 
def 



A,uiui^A\, 



Cpq =' 'Ylt=i -^iU^jU*^ and, Dpq = (Up,«*)I„xn- Recall that, n* is the p-th column of C/* and u* 
is the q-ih left singular vector of the underlying matrix M = U*Ti* (V*)'^ . Finally F is obtained by 
"de-stacking" the vector 

{BD - C) Sv* i.e., the i^^ column of F is given by: 

(B-i {BD - C) Sv* 
(B-^{BD-C) Sv* 



dcf 



{BD - C) St 



Ft 



(8) 



Note that the notation above should have been S*, C* and so on. We suppress the dependence on 
t for notational simplicity. Now, from Update (6), we have 



t+i 



V 

>vrW 



yt+i^(t+i) 1 ^ (y*Yru 
-t+i ^ _y*t^_R(i+i)^\ 



[/* - F] 



(9) 
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where is an orthonormal basis of Span (w*, ti^i " " " ) ^fc)"*"- Therefore, 

dist{V*,V^+^) = \\VlW^+^\\2 = ||yj:"^Fi2(*+i)"^||2 < ||F(S*)^i||2||S*i?(*+i)^^||2. 
Now, we break down the proof of Theorem 4.2 into the fohowing two steps: 

• show that ||F(S*)~-'^||2 is small (Lemma 4.6) and 

• show that ^||2 is small(Lemma 4.7). 

We will now state the two corresponding lemmas. Complete proofs can be found in Appendix 
B.2 The first lemma bounds the spectral norm of F{T,*)~^. 

Lemma 4.6. Let linear measurement A satisfy RIP for all 2k-rank matrices and let b = A{M) with 
M S ]g"*x" being a rank-k matrix. Then, spectral norm of error matrix F(T,*)~^ (see Equation 6) 
after t-th iteration update satisfy: 



\F{T.*y% < -^^distiU',U*). 



(10) 



1 - '^2fc 

The following lemma bounds the spectral norm 

Lemma 4.7. Let linear measurement A satisfy RIP for all 2k-rank matrices and let b = A{M) 
with M G ]^»^x" being a rank-k matrix. Then, 



||S*(i?(*+l))^l||2 < , 

^i-dist2([/t, V*) - ('^iV'^rJwti^sa*) 

With the above two lemmas, we now prove Theorem 4.2. 



(11) 



Proof Of Theorem 4-3- Using (9), (10) and (11), we obtain the following: 

dist(F*+\y*) 



< 



< \\Vl\\^ ||F(S*)-i| 



< 



{al/al)62kk-dist{U',U* 



(1 - S2k)L 

{al/al)S2kkdist{U\U'') 



(12) 



where L = y 1 - dist (f/*, U*) j^^^ 

of '^j^Aibi. Hence, using Lemma A.l, we have: 

p(c/OsV° - c/*s*(y*)t||2 < 4S2k\\A{u*j:*{v*) 
^||[/°sV° - c/*s*(y*)t||2, < ^^^^^^ ^ 352k)\\^*\\l, 

1* l|2 



Also, note that ^7" is obtained using SVD 

t^ll? 



\\U°{UyU*^*{V*)^ - U*^*{V*)^\l < 662k\\^*\\F, 
{alf\\{U\uy - I)U*\\l < 652kkialf, 



^dist(C/°,C/*) < ^/662kk 



a 



< 



(13) 



Using (12) with dist (?7°, U*) < i and 62k < 2i{a')a*)-^k ^ obtain: dist {V\ V*) < ^dist {U\ U*). 
Similarly we can show that dist (t/*+\ U*) < ^dist {V\ V*) . □ 
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5 Matrix Completion 



In this section, we study the Matrix Completion problem (LRMC) and show that, assuming k 
and ^ are constant, AltMinComplete (Algorithm 2) recovers the underlying matrix M using only 
0{nlogn) measurements (i.e., we prove Theorem 2.5). 

As mentioned, while observing elements in O constitutes a linear map, matrix completion is 
different from matrix sensing because the map does not satisfy RIP. The (now standard) approach 
is to assume incoherence of the true matrix M, as done in Definition 2.4. With this, and the random 
sampling of fi, matrix completion exhibits similarities to matrix sensing. For our analysis, we can 
again use the fact that incoherence allows us to view alternating minimization as a perturbed power 
method, whose error we can control. 

However, there are important differences between the two problems, which make the analysis 
of completion more complicated. Chief among them is the fact that we need to establish the 
incoherence of each iterate. For the first initialization U^, this necessitates the "clipping" procedure 
(described in step 4 of the algorithm). For the subsequent steps, this requires the partitioning of 
the observed into 2T + 1 sets (as described in step 2 of the algorithm). 

As in the case of matrix sensing, we prove our main result for matrix completion (Theorem 2.5) 
by first establishing a geometric decay of the distance between the subspaces spanned by 
and U* ,V* respectively. 

Theorem 5.1. Under the assumptions of Theorem 2.5, the (t+1)*'* iterates U^~^^ and V^'^^ satisfy 
the following property w.h.p.: 

dist (V^^\ V*^ < ^dist (u\ [/*) and 

dist (u^^\ [/*) < ^dist (v^+\V*^ , V 1 < t < T. 

We use the above result along with incoherence of M to prove Theorem 2.5. See Appendix C 
for a detailed proof. 

Now, similar to the matrix sensing case, alternating minimization needs an initial iterate that is 
close enough to U* and V* , from where it will then converge. To this end. Steps 3 — 4 of Algorithm 
2 use SVD of Pn{M) followed by clipping to initialize U^. While the SVD step guarantees that 
is close enough to U* , it might not remain incoherent. To maintain incoherence, we introduce an 
extra clipping step which guarantees incoherence of while also ensuring that is close enough 
to U* (see Lemma 5.2) 

Lemma 5.2. Let M,Q,p be as defined in Theorem 2.5. Also, let be the initial iterate obtained 
by step 4 of Algorithm 2. Then, w.h.p. we have 

• dist (C/°, U*) < \ and 

• is incoherent with parameter 4fj,^/k. 

The above lemma guarantees a "good" starting point for alternating minimization. Using this, 
we now present a proof of Theorem 5.1. Similar to the sensing section, we first explain key ideas 
of our proof using rank-1 example. Then in Section 5.2 we extend our proof to general rank-/c 
matrices. 
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5.1 Rank-1 Case 

Consider the rank-1 matrix completion problem where M = a*u*{v*)'^ . Now, the t-th step iterates 
v^^^ of Algorithm 2 are given by: 



argmin {Mij—ujvjY 



Let = {t7||2*||2- Then, Vj: 



^ II2 2^ K) V ^ 

(7* 



Hence, 



|S*||2^*+^ = («*,u*)<T*7;*-(j*B"i ((n*,n*)B-C)z;*, (15) 



Power Method Error Term 

where B,C € M"^"' are diagonal matrices, such that. 

Note the similarities between the update (15) and the rank-1 update (4) for the sensing case. Here 
again, it is essentially a power-method update (first term) along with a bounded error term (see 
Lemma 5.3). Using this insight, we now prove Theorem 5.1 for the special case of rank-1 matrices. 
Our proof can be divided in three major steps: 

• Base Case: Show that = S°/||2''||2 is incoherent and have small distance to u* (see 
Lemma 5.2). 

• Induction Step (distance): Assuming = n*/||^i*||2 to be incoherent and that n* has a small 
distance to u*, v^~^^ decreases distances to v* by at least a constant factor. 

• Induction Step (incoherence): Show incoherence of v^~^^^ while assuming incoherence of u* 
(see Lemma 5.4) 

We first prove the second step of our proof. To this end, we provide the following lemma that 
bounds the error term. See Appendix C.2 for a proof of the below given lemma. 

Lemma 5.3. Let M, p, $7, u* he as defined in Theorem 2.5. Also, let he a unit vector with 
incoherence parameter fii = ■ Then, w.p. at least 1 — 



\B~^ {{u*,u')B - C) v*\\2 < r^V^ 

i — 02 
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Multiplying (15) with v* and using Lemma 5.3, we get: 



\\v*\\2{v'+\v*) > a*{u\u*) -2a*52V'i-- {u',u*)^, (17) 

where J2 < ]^ is a constant defined in the Theorem statement and is similar to the RIP constant 
in Section 4. 

Similarly, by multiplying (15) with v± (where {v*^,v*) = and II2 = 1) and using Lemma 5.3: 



Using the above two equations: 



45^1- in', u*)^) 



((u*,n*) - 252^1- (n*,n*)2)2 + (252^1 - (n*, u*)2) 



2 



Assuming, {v^^^,v*) > 6(5; 



2, 



dist(i;*+\i;*) = Vl - {v'+\v*)^ < - {u\u*)^. 

Using same arguments, we can show that, dist(u*''~^, u*) < dist(t;*"''^, f*)/4. Hence, after 0(log(l/e)) 
iterations, dist(u*,ii*) < e and dist(u*'^^, i;*) < e. This proves our second step. 

We now provide the following lemma to prove the third step. We stress that f*"*"^ does not 
increase the incoherence parameter (fii) when compared to that of u*. 

Lemma 5.4. Let M , p, J7 be as defined in Theorem 2.5. Also, let ti* he a unit vector with inco- 
herence parameter /ii = ■ Then, w.p. at least 1 — v^~^^ is also fii incoherent. 

See Appendix C.2 for a detailed proof of the lemma. 

Finally, for the base case we need that is /ii incoherent and also {u^, u*) > 682 ■ This follows 
directly by using Lemma 5.2 and the fact that 62 < 1/12. 

Note that, to obtain an error of e, AltMinComplete needs to run for O (log M:^!!^^ iterations. 
Also, we need to sample a fresh ft at each iteration of AltMinComplete. Hence, the total number 
of samples needed by AltMinComplete is O (log Milll^ larger than the number of samples required 
per step. 

5.2 Rank- A; case 

We now extend our proof of Theorem 5.1 to matrices with arbitrary rank. Here again, we show 
that the AltMinComplete algorithm reduces to power method with bounded perturbation at each 
step. 

Similar to the matrix sensing case, we analyze the following QR decomposition based update 
instead of directly analyzing the updates of Algorithm 2: 

[/* = U^rIj (QR decomposition), 

= argmin \\Pn{U'V^) - Pf7(M)|||, 
V 

t>*+i = (QR decomposition) , 

= argmin ||Pn(^(^*+')^) - Pn{M)\\l. (18) 
u 
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Here again, we would stress that the updates output exactly the same matrices at the end of each 
iteration and we prefer QR-based updates due to notational ease. 

Now, as matrix completion is a special case of matrix sensing. Lemma 4.5 characterizes the 
updates of the AltMinComplete algorithm (see Algorithm 2). That is. 

Power-method Update Error Term 

= y*+i(i?(*+i))-i, (19) 

where F is the error matrix defined in (8) and R^^~^^^ is a upper-triangular matrix obtained using 
QR-decomposition of F*"*"^. See (7) for the definition of B, C, D, and S. 

Also, note that for the special case of matrix completion, Bpg,Cpq,l < p,q < k are diagonal 
matrices with 

We use this structure to further simplify the update equation. We first define matrices , S 
^fcxfc^ 1 < i < n: 

=- V (^7*)«(c/*)(*)^ =- V (c/*)W(c/*)(')^ 

and = {U^)'^U*. Using the above notation, (19) decouples into n equations of the form 
(1 <i <n): 

(y*+i)0-) = (y*)(i)(Di - {b^)-\bW^ - C7J'))(ii(*+i))-\ (20) 

where (y*+i)0) and (y*)(j) denote the j^^ rows of and V* respectively. 

Using the above notation, we now provide a proof of Theorem 5.1 for the general rank-A; case. 

Proof of Theorem 5.1. Multiplying the update equation (19) on the left by {V^)\ we get: 
(y*)tyi+l = _(1/j^)ti7(ij(i+l))-l. That is, 

dist(F*,y*+i) = iiy_i:V(*+^)|i2 = \\vi^FR^'+^'^~\\2 

< ||F(S*)-i||2||S*i?(*+^)"^||2. 

Now, similar to the sensing case (see Section 4.2) we break down our proof into the following two 
steps: 

• Bound ||-F(S*)~^ (Lemma 5.6) and 

• Bound II2, i.e., the minimum sing ular value of (S*)"^ (Lemma 5.7). 
Using Lemma 5.6 and Lemma 5.7, w.p. at least 1 — 1/n^, 

{al/al)k{52k/{l - 52k)) ■ dist (C/W, U*) 



dist(y*,T/*+^) < ||F(S*)-^||2||S*i?(*+^) II2 < 



1 - dist {m),u*)' - KK)^y-^^(^'-^^') ■ 
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Now, using Lemma 5.2 we get: dist(C/*, C/*) < dist([/'', [/*) < i. By selecting 62k < i2ka* ' 
P > m(a-*)^ using above two inequalities: 

dist(1/*+\y*) < ^dist(C/*,[/*). 

Furthermore, using Lemma 5.5 we get that is incoherent. Hence, using similar arguments 

as above, we also get: dist([/*+\ C/*) < {{) dist(F*+i, F*). □ 

We now provide lemmas required by our above given proof. See Appendix C.3 for detailed proof 
of each of the lemmas. 

We first provide a lemma to bound incoherence of F*"*"^, assuming incoherence of C/*. 



Lemma 5.5. Let M, Q,p be as defined in Theorem 2.5. Also, let [/* he the t-th step iterate obtained 



by (18). Let [/* be /.ti = — — incoherent. Then, w.p. at least 1 — iterate is also /ii 

incoherent. 



We now bound the error term (F) in AltMin update (19). 

Lemma 5.6. Let F be the error matrix defined by (8) (also see (19)) and let [/* be a fii-incoherent 
orthonormal matrix obtained after {t — 1)*'' update. Also, let M, Q, and p satisfy assumptions of 
Theorem 2.5. Then, w.p. at least 1 — : 

||i^(s*)-^|L < -^^^dist{u\u*). 

^ 1-52^ 

Next, we present a lemma to bound ||(/2(*'*"^))~-'^||2. 

Lemma 5.7. Let i?^*"*"^) be the lower-triangular matrix obtained by QR decomposition of 1/*+-^ ( 
see (19)) and let [/* be a fii -incoherent orthonormal matrix obtained after {t — update. Also, 
let M and il. satisfy assumptions of Theorem 2.5. Then, 

'1 - dist^ {Ui^),U*) - 

Proof. Lemma follows by exactly the same proof as that of Lemma 4.7 for the matrix sensing 
case. □ 



6 Stagewise AltMin Algorithm 

In Section 4, we showed that if 62k < J^^Yk t^ien AltMinSense (Algorithm 1) recovers the underlying 

matrix. This means that, d = -^^tjiA;^?! log n random Gaussian measurements (assume m < n) are 

required to recover M. For matrices with large condition number {a\ /o"^), this would be significantly 
larger than the information theoretic bound of 0{kn\ogn/k) measurements. 

To alleviate this problem, we present a modified version of AltMinSense called Stage-AltMin. 
Stage- AltMin proceeds in k stages where in the i-th stage, a rank-i problem is solved. The goal of 
the z-th stage is to recover top i-singular vectors of M , up to 0(cr*_,_^) error. 
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Algorithm 3 Stage- AltMin: Stagewise Alternating Minimization for Matrix Sensing 



1: Input: b,A 

3: for i = 1, • • • , A; do 

4: V,%] = top i-singular vectors of (u^.,_,{V,l_,)^ - f - b)) i.e., 

one step of SVP [10] 

5: for t = 0, • • • , r - 1 do 

6: Vl+' ^ argminv^eR-. - b\\l 

7: Ulf ^ argmin^g^^x. P(C/i;,(y/+i)t) - b\\l 

8; end for 

9: end for 

10: Output: X = U^iiV^i)^ 



Specifically, we initialize the i-th stage of our algorithm using one step of the SVP algorithm 
[10] (see Step 4 of Algorithm 3). We then show that, if < ig^, then Stage- AhMin (Steps 6, 7 
of Algorithm 3) decreases the error ||M — Ui:i{Vi^i)^\\F to 0{a*_^_i). Hence, after k steps, the error 
decreases to 0{al_^-^^) = 0. Note that, C/f.j G R"^^* represents the t-th step iterate (U) in the i-th 
stage; V^.^ E M"^* is also defined similarly. 

Recall that, the main problem with our analysis of AltMinSense is that if cjj S> cxj+i (for 

some i) then ^ {a*)^)^ need to be small. However, in such a scenario, the i-th stage 

of Algorithm 3 can be thought of as solving a noisy sensing problem where the goal is to recover 
Mi =^ f^r:i^i;j(^i*j)^ using noisy measurements b = ^(C/j'.jS^.-(y|*.)''' + N) where noise matrix 
^ def ^^^*^_^^)t. Here Mi and A'' represent the top i singular components and last k — i 

singular components of M respectively. Hence, using noisy-case type analysis (see Section B.3) we 
show that the error ||M — U^{V^)^\\f decreases to 0{a*_^i). 

We now formally present the proof of our main result (see Theorem 2.3). 

Proof Of Theorem 2.3. We prove the theorem using mathematical induction. 
Base Case: After the 0-th step, error is: ||M|||, < < kal. Hence, base case holds. 

Induction Step: Here, assuming that the error bound holds for [i — l)-th stage, we prove the 
error bound for the i-th stage. 

Our proof proceeds in two steps. First, we show that the initial point U^.^, V^.^ of the i-th stage, 
obtained using Step 4, has c((T*)^ -|- O (fc(cr*_,_^)^) error, with c < 1. In the second step, we show 
that using the initial points C/{'.j, V^.^, the AltMin algorithm iterations in the i-th stage (Steps 6, 7) 
reduces the error to max(e, 16A;o"?^^). 

We formalize the above mentioned first step in Lemma 6.1 and then prove the second step in 
Lemma 6.2. □ 

We now present two lemmas used by the above given proof. See Appendix B.4 for a proof of 
each of the lemmas. 

Lemma 6.1. Let assumptions of Theorem 2.3 be satisfied. Also, let U^.i, Vi-i be the output of 
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Step 4 of Algorithm 3. Then, assuming that \\M — Uj.^_^V^^_i\\^p < 16k{al)'^, we obtain: 

k 

j=i+i 

Lemma 6.2. Let assumptions of Theorem 2.3 he satisfied. Also, let Uf.j^, V'^^ he the T-th step 
iterates of the i-th stage of Algorithm 3. Then, assuming that \\M — U^.^^V^.j^Wp < X]j=j+i(o"j )^ + 
j^io'i)'^, we obtain: 

where T = n{log{\\M\\F/e)). 

7 Summary and Discussion 

Alternating minimization provides an empirically appealing and popular approach to solving sev- 
eral different low-rank matrix recovery problems. The main motivation, and result, of this paper 
was to provide the first theoretical guarantees on the global optimality of alternating 
minimization, for matrix completion and the related problem of matrix sensing. We would like 
to note the following aspects of our results and proofs: 

• For both the problems, we show that alternating minimization recovers the true matrix under 
similar problem conditions (RIP, incoherence) to those used by existing algorithms (based on 
convex optimization or iterated SVDs); computationally, our results show faster convergence 
to the global optima, but with possibly higher statistical (i.e. sample) complexity. 

• We develop a new framework for analyzing alternating minimization for low-rank problems. 
Key observation of our framework is that for some problems (under standard problem condi- 
tions) alternating minimization can be viewed as a perturbed version of the power method. 
In our case, we can control the perturbation error based on the extent of RIP / incoherence 
demonstrated by the problem. This idea is likely to have applications to other similar prob- 
lems where trace-norm based convex relaxation techniques have rigorous theoretical results 
but alternating minimization has enjoyed more empirical success. For example, robust PCA 
[6, 2], spectral clustering [11] etc. 

• Our analysis also sheds light on two key aspects of the alternating minimization approach: 
Initialization: Due to its connection to power method, it is now easy to see that for al- 
ternating minimization to succeed, the initial iterate should not be orthogonal to the target 
vector. Our results indeed show that alternating minimization succeeds if the initial iterate 
is not "almost orthogonal" to the target subspace. This suggests that, selecting initial iterate 
smartly is preferable to random initialization. 

Dependence on the condition number: Our results for the alternating minimization 
algorithm depend on the condition number. However, using a stagewise adaptation of alter- 
nating minimization, we can remove this dependence for the matrix sensing problem. This 
suggests that (problem specific) modifications of the basic alternating minimization algorithm 
may in fact perform better than the original one, while (mostly) retaining the computational 
/ implementational simplicity of the underlying method. 



< max(e, 16k{a*^;^f), 
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A Preliminaries 



Lemma A.l (Lemma 2.1 of [10]). Let b = A{M) + e, where e is a bounded error vector, M is 
a rank-k matrix and A is a linear measurement operator that satisfies 2k-RIP with constant 62k 
(assume 62k < Let X^~^^ be the t + 1-th step iterate of SVP, then the following holds: 

\\A{X'+') - < \\A{M) - b\\l + 262k\\A{X') - b\\l 

In our analysis, we heavily use the following two results. The first result is the well-known 
Bernstein's inequality. 

Lemma A. 2. [Bernstein's inequality] Let Xi, X2, ■ ■ ■ ,Xn be independent random variables. Also, 
let \Xi\ < L G M V i w.p. 1. Then, we have the following inequality: 



> t 



< 2exp 



(22) 



Er=iVar(X,)+Li/3, 

The second result is a restatement of Theorem 3.1 from [12]. 

Theorem A. 3. (Restatement of Theorem 3.1 from [12]) Suppose M is an incoherent rank-k matrix 
and let p, be as in T. 
Then, w.h.p. we have: 



and let p,Q be as in Theorem 2.5. Further, let be the best rank-k approximation of ^Pn (M) 



|M-Mfc||2 < 



p^mn 



\M\ 



(23) 



Remark: Note that Theorem 3.1 from [12] holds only for Tr{Pn{M)) where Tr{Pn{M)) is 
a trimmed version of P^[M) obtained by setting all rows and columns of Pq,{M) with too many 
observed entries to zero. However, using standard Chernoff bound we can argue that for our 
choice of p, none of the rows and columns of Pq,{M) have too many observed entries and hence 
Tr.[Pn{M)) = Pn{M), whp. 



B Matrix Sensing: Proofs 

The following is an alternate characterization of RIP that we use heavily in our proofs. At a 
conceptual level, it says that if A satisfies RIP, then it also preserves inner-product between any 
two rank-fc matrices (upto some additive error). 

Lemma B.l. Suppose A{-) satisfies 2k-RIP with constant 62k- Then, for any Ui,U2 G M™^'^ and 
Vi, V2 S M"^'^', we have the following: 



AiUiVn ,A{U2V^ 



Tr [Ulu iVlV2 



< 3(5. 



2k 



UiVl 



U2V^ 



(24) 



Proof. Consider the matrices Xi = UiV^ , X2 = U2V2 and X = Xi + X2. Since the rank of X is 
at most 2k, we obtain the following using the RIP of A: 



(1 - 5) \\UiV^ + U2V^\\p < \\A{X)\\l <{l + 5) \\UiV^^ + U2Vi 
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Concentrating on the second inequality, we obtain 

(Tr {AUiVf) + Tr (A^C/aVf ))' < (1 + <5) {\\UiV,^\\l + ||C/2Vf ||^ + Tr {UiV,^V2U^ 

i 

^ TV [AiUiV^) Tr {A,U2V^) - Tr {UiV^V2U^) < 6 {\\UiV^\\l + 11^72^2^11^ + (^^iVf V2C/; 

i 

J^Tr {AiUiV,^) Tr (A^t/slf ) - Tr {UiV^V2U^) < 6 {\\UiV^\\l + \\U2V^\\l + ||f/ilf ||^ ||C/2V^2' 

(25) 

where ((^1) follows from the fact that Xi and are rank-fc matrices and hence A{-) satisfies RIP 
w.r.t. those matrices and (C2) follows from the fact that Tr [UiVi"V2U2) < 11^2^2^11^. 
Note that if we replace UiV^ by XUiV^ and U2V^ by \U2V2 in (25) for some non-zero A G M, 
the LHS of (25) does not change where as the RHS of (25) changes. Optimizing the RHS w.r.t. A, 
we obtain 

^Tr {AiUiV^) Tr {A,U2V^) - TV {U^UiV^V2) < 35 \\UiV^\\p \\U2V^\\p . 



A similar argument proves the other side of the inequality. This proves the lemma. □ 
Proof of Lemma 4-5- We first show that the update (6) reduces to: 

f: f E A^u^'-fAl) = E f E A^u^^nfAl) v; V p G [k]. (26) 

q=l \i=l ) q=\ \i=\ ) 

Let Erriy) =^ ^. (TV [A,M) - Tr {^AiV^^W^)f . Since minimizes E(y), we have Vy-E(y(*+i)) 

0. 



Define 



V^^^rr(y(*+i)) = 

s / k k \ 



i=l \l=l 
k s 



1=1 



k s 



E E ^^^p (vi'+'^^ A,uf^) = E E ^^^p {«^A^<) 

■ 1=1 i=l 

k s 

EE^^"p(^S^^l«) 



1=1 1=1 

k s 



s 



EE^^^p 


(uf. 




1=1 i=l 






k / s 






E E^^- 




1=1 \i=i 






alln ■■■ 


o„ " 








* 

, V = 


_0n ... 







1=1 i=l 
k 



E E^^W4 



1=1 \i=i 



, and v\ 



(t+i) 



4t+l) 
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Then, 



= DSv* - B'^ {BD - C) Sv* 

where inverting B is valid since the minimum singular value of B is strictly positive (please refer 
Lemma B.2). Considering the p^^ block of v^^\ we obtain 



(^Y.i'^^luDalv*^ - {B-\BD-C)Sv*)^ 
{Y. <^>i^ - {BD - C) Sv*),^ . 



This gives us the following equation for y^*); 

_ y*5]*^*1'[7W — F 

where F =[ {B-^BD-C)Sv*)^ [B'^ {BD - C) Sv*) ^ ■■■ {B-^BD - C) Sv*) ^ ] . 

Hence Proved. □ 

B.l Rank-1 Matrix Sensing: Proofs 

Proof of Lemma 4-3- Using definition of the spectral norm: 

\\B-^{{u*,u')B-C)v*\\ < \\B-%-\\{u*,u')B-C\\2-\\v*h. (27) 
Consider B = j4jU*(n*)"l"Aj. Now, smallest eigenvalue of B, i.e., \min{B) is given by: 

Amm(-B) = min z^Bz= min ^ AiU^{u^)'^ A\z = min Tr(Aiii*z"l') Tr(^in*z"'"), 

= mm^{A{u^z^),A{u^z'')) > 1 - 352, (28) 

where the last inequality follows using Lemma B.l. Using (28), 

\\B-% < (29) 

Now, consider G = {u*,u^)B-C = ^^Ai {{u* ,u^)u\u^)^ - u\u*)^) A\ = Y^i^i^^ {{u* ,u^)u^ - u*)^ A\. 
Using definition of the spectral norm: 

IIGIb = max z^Gy, 
\\4=U\y\\=^ 



max z^AiuU{u*,u^)u^ - u*)'' A]y, 

=i,lls/ll=iY ' ' ' ^ 

{A{uh^),A({{u*,u')u' -u*)y^)), 



max 

|2||=i.llj/ll=i 



< 352\/l-(n*,n*)2, (30) 

where the last inequality follows by using Lemma B.l and the fact that (u*, {{u*, v}')vl' — u*)) = 0. 
Lemma now follows using (27), (29), (30). □ 
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B.2 Rank- A; Matrix Sensing 

Proof of Lemma 4-4- Since [/* and C/* have full rank and span the same subspace, there exists a 



k X k, full rank matrix R such that U = U R = U R\jR. We have: 



A ( f/Vn - h 



> 



with equality holding in the last step for V = . The proof of Theorem 4.2 shows 

that is unique and has full rank (since dist (y*^"*^,!^*) < 1). This means that is also 



unique and is equal to V^'^^ ^(fij^i?)^^ . This shows that Span ^F*"*^^^ = Span ^y*^"*^^ and that 
both and have full rank. □ 

Lemma B.2. Let linear measurement A satisfy RIP for all 2k-rank matrices and let b = A{M) 
with M G i^TJxn 5gjj^^ g rank-k matrix. Let 62k be the RIP constant for rank 2k-matrices. Then, 
we have the following bound on the minimum singular value of B: 



0'min{B) > I — S2k- 



(31) 



w 



Proof. Select any w G M" such that \\w\\2 = 1. Let 

Wl 
W2 

_ Wk 

where each Wp G W^. Also denote W '= [wiW2 ■ ■ ■ Wk] G M'^^'^, i.e., w = vec{W). 
We have, 

k k / d \ d k 

p,g=l P,'?=l \i=l / i=l p,q=l 

i=l \p=l J \q=l J i=l 

Now, using RIP (see Definition 2.1) along with the above equation, we get: 

d 



Wr, 



w 



Bw = Y^^(^iU^Wn >{l-6) 



i=l 



(1 - 62k) WWfp = (1 - S2k)\\wf = (1 - S2k). 



Since w was arbitrary, this proves the lemma. 
Proof of Lemma 4-6. Note that, 

\\F{Y,*y''\\ < ||F(S*)~i|| 



□ 



B^^{BD-C)v*\ 



< \\B- 



{BD-C)\\,\\v*\\, 



(32) 
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where the last step follows from Lemma B.2. Now we need to bound ||(-B-D — C)||2. Choose any 
w,z ^ MJ^^ such that 11x^112 = ||-z||2 = 1- As in Lemma B.2, define the following components of w 
and z: 





Wl 














w = 




and z = 













where each Wp^ Zp G M" and W '= {wiW2 ■ ■ ■ Wk\ and Z [212:2 ' ' ' Zk\ S M'^^*^. We have 



dcf 



w 



t {BD -C)z= (BD - C) 

p,q=l 



pq ^9 



i=l 



/ k d \ 

\l=l i=l / 



i=l 



d k 



We calculate {BD — C)^^ as follows: 

k / k 

{BD - C)p^ = BpiDi, -Cp,= \Y, Bpiiul u;)Ir. 
1=1 \i=i 

d k d 

So we have, 

{BD - C) z = ^lY. [U\U')^ - I„xn) A\z, = E E [U\U')^ - I„x„) A 

p,q=l i=l i=l P,(/=l 

d 

= [a.U'W^^ Tr (a, (u\U')^ - I„x„,) U*Z^^ 

i=l 

<^ Tr (c/*t ([/*([/*)t - U'W^Z^ + 62k \\U'W^\\^ II (c/*([/*)t - I^^n) f^*^^||p 

<^ 62k\\W\\F^\\{U*)^ ([/*(C/*)t - I„xn)' U*\\^ \\ZiZ\\p 

(C3) ^ 

< 52kVk-dist{U\U*), 

where (Ci) follows from the fact that A satisfies 2/c-RIP and Lemma B.l, (^2) follows from the 
fact that - Inxn) = 0, (Cs) follows from the following: \\W\\f = \\w\\2 = 1, \\Z^Z\\f < 

\\Z\\l = 1 and and finally : || (C/*(C/*)t - I„xn,) t/* Hp - ^ II - ^nxn) U*\\^. 

Since w and z were arbitrary unit vectors, we can conclude that \\BD — CII2 < ^2kVk ■ 
dist{U*, U*). Plugging this bound in (32) proves the lemma. □ 
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Proof of Lemma 4.7. Note that ||S*(ii(*+^))-^||2 < ^i+m- Now 



^mi„(J?(* + l))' 

TrainiR'-'^'^) = mill \\R('+^h\\2 = mill \\V^'+'^ R'-'+'^h, 

Z,||2:||2 = l 2,||2:|j2 = l 

= mill ||y*S*C/*"^C/Wz - Fzlh, 

^:lkll2 = l 

> mill ||y*S*C/*1'C/Wz||2 - \\Fz\\2, 

Z,\\z\\2 = l 

> min ||y*S*C/*'^C/Wz||2 - ||F||2, 

2,||z||2 = l 

> (Tla,nUU*^U^'^) - \\F\\2, 

II 

2 "111- V" i II2, 



>alJl-\\Ul^Um;-al\\F{^ 



= crl^l- dist(^7*, [/W)2 - al\\F{lTy%. (33) 
Lemma now follows using above inequality with Lemma 4.6. 

□ 

B.3 Noisy Matrix Sensing: Proofs 

We now consider an extension of the matrix sensing problem where measurements can be corrupted 
arbitrarily using a bounded noise. That is, we observe h = A (M + A^), where is the noise matrix. 
For this noisy case as well, we show that AltMinSense recovers M upto an additive approximation 
depending on the Frobenius norm of N . 

Theorem B.3. Let M and A{-) he as defined in Theorem 2.2. Suppose, AltMinSense algorithm 
(Algorithm 1) is supplied inputs A, b = A{M+N), where N is the noise matrix s.t. \\N\\p < j^o"^- 
Then, after T = A\og{2 / e) steps, iterates U'^ , of AltMinSense satisfy: 

dist , V*] < Iffl^ + e, dist (U\ U*] < + e. 



See Definition 4-1 for definition o/dist ([/, W). 

Proof. At a high level, our proof for noisy case follows closely, the exact case proof given in Section 4. 
That is, we show that the update of AltMinSense algorithm is similar to power-method type update 
but with two errors terms: one due to incomplete measurements and another due to the noise 
matrix. 

Similar to our proof for sensing problem (Section 4), we analyze QR-decomposition based up- 
dates. That is, 

[/* = (QR decomposition), 

= argmin \\A{U^V^) - b\\l 
V 

yt+i ^ yt+iM_ (Qj^ decomposition) 
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Similar to Lemma 4.5, we can re-write the above given update equation as: 

yt+i ^ v*T.*{U*)^U^ -F + T/^S^(C/^)ti7W - G, 



(34) 



where, F is the error matrix and is as defined in (8) and G is the error matrix due to noise and is 
given by: 

G''^' {BD'' -C'')S''v'')^ ••• (i?-i(i?I?^-C^)5V)J , (35) 
where B, C and D defined in the previous section (See (7)) and G^ and are defined below: 

^11 



G 



TV dcf 



riN 
'-^11 



km 



D 



^krn 



withG^/^'EtlA^u^\u^)^Al andl)p^/=^f (4*),Ol„xn. Also, 







■ On ' 


















_ On ■ 


"TV 







Now, multiplying (34) with V"j^, we get: 



That is. 



dist(y*,y*+^) = \\viW+^\ 



< (||y^S^(C/^)tc/W||2 + ||F|2 + ||G||2)||(i2(*+^))"'||2, 

< « + ||F(S*)-1||2||S*||2 + IIGII2) ||(i?(*+l))-^||2, 
1 - 02k J 



(36) 



where the last inequality follows using Lemma 4.6. 

Now, we break down the proof in the following two steps: 

• Bound IIGII2 (Lemma B.4, analogous to Lemma 4.6) 

• Bound ||(ii''*'*""'^^)~"'^||2 (Lemma B.5, similar to Lemma 4.7) 

Later in this section, we provide the above mentioned lemmas and their detailed proof. 

Now, by assumption, cr^ < ||A^||p < o"^. Also, as S2k < 1/2, ^ 2. Finally, assume 

dist(y*, y*"*"^) > max(10 • ifl H^j^ ). Using these observations and lemmas B.4, B.5 along with 
(36), we get: 



dist(1/*,y*+^) < 



0.5dist ([/*,?/* 



-dist(C/*,C/*)2 - 0.5dist([/* , [/*) 



(37) 
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As, [/" is obtained using SVD of ^ ■ Aibi. Hence, using Lemma A.l, we have: 

\\A{U^T,^V^ - C/*S*(y*)t)||2 < o.5P(Af)||^ + 462k\\A{U*^*{V*y)\\l 

IIC/OSV" - [/*S*(y*)t||2, < \\N\\l + A62k{l + S2km*fF, 
^ (alfWiU^Uy - I)U*\\l < ml + 4<52fc(l + 62k)kialf, 

dist(C/0,[/*) < \\{u\uy-i)u*fF <^^ + Qhkk (4)' < ^> 

where last inequahty follows using ^^-^Jr^ < 1/100. 

Theorem now follows using above equation with (37). □ 

Lemma B.4. Let linear measurement A satisfy RIP for all 2k-rank matrices and let b = A{M+N) 
with M G ]^"^x" being a rank-k matrix and let N = U^I!,^ (V^)^ . Let 62k be the RIP constant for 
rank 2k-matrices. Then, we have the following bound on ||G||2. 



1 - 



\\G\\, < (38) 

Proof. Note that, 

IIGII2 < \\G\\f = \\B~\BD^ - C^)5^t;^||2 

< \\B-'\\2\\{BD'' - C^)5^||2||5V||2 < -^UBD'' - C^)S^||2, (39) 

1 - 02k 

where the last inequality follows using Lemma B.2 and the fact that = Vk- Now let 



I.]''' G R"'^ and z = [z\ Z2 ... ^n]^ G be any two arbitrary vectors such that 



1^112 ~ 11-^112 ~ ^- Then, 



k n d 

BD^ - C^) = E E E ^^4<^ {^'^^')^ - I-n) A\a^z, 

p=l q=l i=l 
d k n 

i=l p=l q=l 
d / k \ / " 

i=l \p=l I \q=l 



'li / a 

E E {AUW^) TV ^Ai {U\uy - Ir, 



"-q '^q ^q 



q=l \i=l 
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Now, using RIP, we get: 

n 



W 



9=1 

n 



'J2k 



<5]'^2fc||^^^||F||(C/*(f/*)^-Inxn)<||2|K^,||2, 
q=l 



<d2k E 

9=1 



I"? II2 Ir? ^9112 



9=1 



< S2k. 

This finishes the proof. 



< 52. IliVllF 

9=1 



□ 



Lemma B.5. Assuming conditions of Lemma B.4, we have the following bound on the minimum 
singular value of i?*-*^ ; 



12 ~ 11*^112 • 



armn (iJ^^+'^j > a^l - dist{m ,U*)^ - - ||F|| 

Proof. Similar to the proof of Lemma 4.7, we have the following set of inequalities: 

'^(*+i)^ 



0"r, 



I = min 




= min 


yW^(t+i)^. 




^ Ikll2=i 




2 Pll2 = l 




2 



mm 



> min 

Pll2 = l 



y*s*c/*'^?7*z + y^s^(c/^)tc/Wz -Fz-Gz 



V*T.*U*'^U^z 



> o"^ min 



1-^112 - 11^112 



1-^112 ~ Il<^ll2 



>cjl\l\-\U*^\j\l-a^ 



N 



\F\\ 



\\G\\, 



= a^Vl-dist(C/*,C/*)2-af - IIFII2 - IIGII2 . 

This proves the lemma. 

B.4 Stagewise Alternating Minimization for Matrix Sensing: Proofs 



□ 



Proof of Lemma 6.1. As the initial point of the i-ih. stage is obtained by one step of SVP [10], using 
Lemma A.l, we obtain: 



M-UlAV?..;)' ^< EK)' + 2'^2.||Af-^?,„,^,^_i||^ 
j=i+i 

Now, by assumption over the {i — l)-th stage error (this assumption follows from the inductive 
hypothesis in proof of Theorem 2.3), 



j=i+i 
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Lemma now follows by setting 62k ^ 



1 

3200A: • 



□ 



Proof of Lemma 6.2. For om' proof, we consider two cases: a) < 5\/A;, b) > 5\'%. 

Case (a): In this case, using monotonicity of the AltMin algorithm directly gives error bound. 

That is, ||M - f/J,(i>^)t||| < ||M - C/0,yOj|^ < + fi«+i)'- 

Case (b): At a high level, if > bVk then Ui.^ is "close" to U^.^ and hence the error bound 
follows by using an analysis similar to the noisy case. Note that cj*^.]^ being small implies that the 
"noise" is small. See Lemma B.6 for a formal proof of this case. □ 



Lemma B.6. Assume conditions given in Theorem 2.3 are satisfied and let >b\/k. Also, let 

k 



j=i+i 



Then, Uf.^, V^^ satisfy: 



M-f/iWillF<max(e,16/c(a*+i) 



Proof. We first show that if ai and fij+i have large gap then V t, the t*^ iterate of the i-th stage, 
J7*.j is close to U^.^. Let U\_ be a basis of the subspace orthogonal to U^.^. 



fj, 



i+l 



i 2 



5Vk ' 



(40) 



We also have: 



^* ^t||2„ 



{uiY (M - uiAvhmi <\\M- uiM:;)'WF 

1 



< 



1 




i-hk 

1 



1 



nk 



o.n7o.^ti|2, 



*\2 



100^ 



1 - hk 



(41) 



where (Ci) follows from the fact that lines 5—8 of Algorithm 3 never increases A yM — U\.^{Vl 
Using (40), (41), and 4^ > 5\/fc, we obtain the following bound: 

II {u\_)'^ui,h<\^t. 



(42) 



Now, we consider the update equation for 
yt+l _ „„„™;„ II Alfrt 



^tM|2 



argmin \\A{Ui,iV - UliY:\,i{Vi,iy - t/*+i:feS*+i:fc(V;+i:fc; ;||2. 
V 
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Note that, the update is same as noisy case with noise matrix N = f^*4.i.fcS*_^-|^.^(V^^j^.^)''' from (34): 

= v,*,KAul.^)^ul, -F + v*,,,k^+,,kiuUi:k)^uU - g, (43) 

where F and G are given by (8), (35). Multiplying (43) from the left by vl = / - y*+^(y*+^)^ we 
obtain: 



^vlvi,T.uuiJuU = vi{F- vt,,,^^,,^*^,JuU + g) 

v\vi{El,SJi;fuU ^ < \\Fh + v[v*^,,kT.*^^,j,{U*^,.Jui,, + ||G||p 

r r 

VIVI^T^I, < i , (||F||p + Vlv*+,.,,^^,,,{UU,.JUI, + ||G||p) , (44) 

where the last inequality follows using the fact that (Tmin(^)||^||F < Using Lemma B.4, 

and a modification of Lemma 4.6, we get: 



||i^|lF<'^2fc UiUl^,, , \\G\\p<62k 



< (52fc\/A;o-j+i. 



(45) 



Using (44), (45), and the fact that amm{UlUli) = Jl - ||^^lt^r:ill2> 



+ 



j=i+i 



Assuming 



> '^\/Y^j=i+i we obtain: 



Using similar analysis, we can show that, 
So after T > 8log{ka*) iterations, we have: 



2 

< - 
F - 3 



2 

< - 
F - 3 



(46) 



< 4 E K)'- 



j=i+l 



Using the above inequality, we now bound the error after T > 81og(/ccj*) iterations of the i-th stage: 



< 



(47) 
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For the first term, we have: 



2 
F 



* \t 



(Ci) 
< 



+ 



+ 



+ 3 \\F\\l + 3 \\G\\l + 3 f/iVi:feS*+i^,(y;+i^,)t 



(C2) 

< (1 + 3* 



2k) 



+ 3(1+4) C/iVl:fc5]*+i.,(F;+i.,)t 



< 



(48) 



where (Ci) follows from (43) and (C2) follows from (45). Using (47) and (48), we obtain the following 
bomid: 



i+i- 



Hence Proved. 



(49) 
□ 



C Matrix Completion: Proofs 

Proof Of Theorem 2.5. Using Theorem 5.1, after 0(log(l/e)) iterations, we get: 

dist{U\U*) < e, dist(y*+\y*) < e. 
Now, using (19), the residual after t-th step is given by: 

M - ?7*(F*+^)"^ = (/ - U\uy)M - U^F'^. 

That is, 

\\M-U\V'^^)^\\f < \\{I-U\uy)M\\F+\\F\\F < Vk\\{I-U\U')^)U*^*\\2+\\F\\F < Vkaldist{U' , U* 
Now, using the fact that dist([/*, U*) < e and the above equation, we get: 

||M - U\V^+^)^\\f < Vkale + < Vkale + alVke < 2alVke, 

where Ci follows by Lemma 5.6 and setting 52k appropriately. Theorem 2.5 now follows by setting 
e' = 2\/l||M||Fe. □ 
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C.l Initialization: Proofs 



Proof Of Lemma 5.2. From Lemma C.l, we see that obtained after step 3 of Algorithm 2 
satisfies: dist (U^ , C/*) < g|^. Lemma now fohows by using the above mentioned observation with 
Lemma C.2. □ 

We now provide the two results used in the above lemma. 

Lemma C.l. After step 3 in Algorithm 2, whp we have: 

1 



distfC/°,C/*) < 



6Ak 



\M\ 



Proof. From Theorem 3.1 in [12], we have the following result: 

/ k " 
\\M -Mk\L < C —= 
\p^Jmn J 

Let U^'^^TiV^ be the top k singular components of M^. We also have: 

|2 



M-Mj 



k\\2 



(Ci) 

> 



U 



(0) 



U*T,* 



U 



(0) 



U* 



(0) 



where (Ci) follows from the fact that the column space of the first two terms in the equation is U 
where as the column space of the last two terms is U^^\ Using the above two inequalities, we get: 



U 



(0) 



U* 



<c4 



< 



/mp 



W^k' 



ifp> 



C'k* logn (cr*)^ 



for a large enough constant C . 



□ 



Lemma C.2. (Analysis of step 4 of Algorithm 2) Suppose U* is incoherent with parameter ji and 
U is an orthonormal column matrix such that dist {U,U*) < g|^. Let be obtained from U by 



setting all entries greater than 



to zero. Let U be an orthonormal basis ofU^. Then, 



• dist \U,U*j < 1/2 and 

• U is incoherent with parameter Afi^/k. 

Proof. Since dist {U,U*) < d, we have that for every i, 3ui € Span(C/*), ||nj||2 = 1 such that 
{ui,Ui) > \/l — d^. Also, since Uj S Span(C/*), we have that Ui is incoherent with parameter ^^/k: 



li\\2 = 1 and \\ui\\^ < 



fi^/k 



m 
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Let u? be the vector obtained by setting all the elements of Uj with magnitude greater than 



def 



to zero and let = Ui — u?. Now, note that if for element j of Ui we have 



> 



then, 



-^V - ul I 







j " j 
u\ - u\ 




— ~ 





. Hence, 



(Ci) 

1^2 ^2 II 2 — 11'^'^ '^'i|l2 



I'"j|l2 + Il^ill2 ~ 2(tij,ni) ) ^ < \/2(i, 



This also implies the following: 



Ihilla < IKII2 < \/2d{V2-d) < 2Vd, for d < 

Let C/^ = UA^^ (QR decomposition). Then, for any u*^ G Span(J7J^) we have: 



t rrc 



|A|l2< 



< llf/is) IIAII2 < (^+ llf^ip) 11^112 ^ (d + 2VAJd) IIAII2 < 3Vkd 



We now bound 

|2 



as follows: 



|A| 



< 



< 



|[/c||^ l-Akd 



<4/3, 



where we used the fact that d < So we have: 

{uD^ U ^ < sVkd ■ 4/3 = 4Vkd. 

This proves the first part of the lemma. 

Incoherence of U follows using the following set of inequalities: 



max ||ej^C/||2 < — ^ max ||ej^C/'^A|| < — ^ max ||ej[/^||2 



< AfiVk. 



□ 



C.2 Rank-1 Matrix Completion: Proofs 

Proof Of Lemma 5.3. Using the definition of spectral norm, 

\\B-^ {{u*,u')B-C) v*\\2 < \\B~^\\2\\{{u*,u')B-C)v*\\2. 

As i? is a diagonal matrix, ||i3~^||2 = ^ — 1-62 ' ^^^^^ ^^^^ inequality follows using 
Lemma C.3. The lemma now follows using the above observation and Lemma C.4. □ 

Lemma C.3. Let M = a*u*{v*)^ , p, ft, be as defined in Lemma 5.3. Then, w.p. at least 1 — ^, 



P 



< <5o 
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Proof. Since the first part of the lemma is a direct consequence of the second part, we will prove 
only the second part. Let 6ij be a Bernoulli random variable that indicates membership of index 
e n. That is, 6ij — 1 w.p. p and otherwise. Define Zj — p X^i '^ij^i^i • Note that 

E[Zj] = {u\u*). Furthermore, E[Z|] = (^1 - ^.(n*n*)2 < and m&i^i\u\u*\ < ^. Using 
Bernstein's inequality, we get: 



Fv{\Zj - {u\u*)\ > 62) < exp 



^4 + /W?<^2/3 



(50) 



Using union bound (for all j) and for p > ^'^^gi ^ , w.p. 1 — Vj, {u^,u*) — 82 < Zj < (n*, u*) + 

62. m, n ^ 



Lemma C.4. Let M = a*u* {v*)\ p, Q, be as defined in Lemma 5.3. Then, w.p. at least 1 — \, 



\\{{u\u')B - C)v*\\2 < 52^1-(n*,n*)2. 
Proof. Let x € M" be a unit vector. Then, Vx: 



{{u*,u')B - C)v* = ^Y1 ^jV*A{^*,^')i<) 



2 t *\ 



<^ ic7V^^^x2(^*)y^((n*,n*)(n*) 
p n 



(51) 



where C > is a global constant and (Ci) follows by using a modified version of Lemma 6.1 by 
[12] (see Lemma C.5) and (^2) follows by using incoherence of v* and u*. Lemma now follows by 
observing that max^_||^||2=ixt((u*,n*)S-C7)t;* = \\{{u* , u^) B - C)v* ^ and p > ^^^^y . □ 

Proof of Lemma 5.4- Using (15) and using the fact that B,C are diagonal matrices: 

= a*{n\n*)v* - ^ {{u\u*)B,, - C,,) v*. 



3J 



We bound the largest magnitude of elements in u*"*"^ as follows. For every j G [n], we have: 



< cr {u ,u }Vj \ + 



{{u\u*)B,j-C,,)v* 



<^ a*{u\u*)J^ + {{u\u*) (1 + ^2) + {{u\u*)+62)) ^ 

^/n I - 02 ^Jn 

3cr*(l+<52)At 



^ 1-32 ^ /^l 



where (Ci) follows from the fact that 1 — (^2 < -Bjj < 1 + (^2 and \Cjj\ < (| (n*, u*) | + (52) (please 
refer Lemma C.3). 
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Also, from (17) we see that: 



(Ci) a* 
- ~2 



where (Ci) follows from the fact that dist {u^,u*^ < (please refer Lemma 5.2). Using the above 
two inequalities, we obtain: 



I Hoc ||^+i|i - (Zl) ^/E' 



This finishes the proof. □ 

Lemma C.5 (Modified version of Lemma 6.1 of [12]). Let Q be a set of indices sampled uniformly 
at random from [m] x [n] with each element of [m] x [n] sampled independently with probability 
P > Then, w.p. at least I - ^, e R"',y e W s.t. Y.i^t = 0, we have: Y^ijan^iVj < 

Cy^-^/mnp||x||2||y||2 5 where C > is a global constant. 

C.3 General Rank-fc Matrix Completion: Proofs 

Proof of Lemma 5.5. From the decoupled update equation, (20), we obtain: 

(F*+i)(j) = (i?(*+i))-i(DJ' - {B^r\BW^ - C7^'))S*(T/*)(J'), 1 < j < 71. 
We bound the two norm of the (y*+i){j) as follows: 



aJKn^/|l^,ll \\Bwq^ + \\cqA 

(g) + ^2k) + (1 + hk) 



al^l - dist^ ([/(*),[/*) - -tS2,kdist{m^),U*) V 1-S2k 

~ al^l - dist^ (^(0),^*) - -i'^>'''f_^(^^'''^U') - 
where we used the following inequalities in {(i): 



\iyyh<^^ (52) 



fimin [R^'^'^) > al^Jl - dist^ (C/W, U*) - al62kkdist{U^'\U*), (53) 

(Tmin (B^) > 1 - S2k and (7,^ax {B^) < 1 + hk, (54) 

(Tmax (C^) < 1 + <52fc and (55) 

<Tmax (I?^) < 1, (56) 
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where (52) follows from the incoherence of V*, (53) follows from from an analysis similar to the 
proof of Lemma 4.7, (54) follows from (the proof of) Lemma C.6, (55) follows from Lemma C.7 and 
finally (56) follows from the fact that = {U^^ U* with [/* and U* being orthonormal column 
matrices. □ 

Proof of Lemma 5.6. Note that, 

||i^(S*)-^||2 < ||F(S*)-^||p = \\B-^ {BD - C) v*\\^ 

< \\B-W\{BD-C)v*\\^ 

<-^dist(C/*,C/*), (57) 

where the last inequality follows using Lemma C.6 and Lemma C.8. □ 

We now bound ||i?~"'^||2 and ||C-' ||2, which is required by our bound for F as well as for our 
incoherence proof. 



Lemma C.6. Let AL,^},p, and [/* be as defined in Theorem 2.5 and Lemma 5.6. Then, w.p. at 
least 1 — 

Il^"'ll2<-^. (58) 

Proof of Lemma C.6. We have: 

IIB-'lb ' 



(B) min^. xtSx' 



where x S M"'^'. Let x = vec{X), the p-th column of X and x^ is the j-th row of X. Now, 

Vx, 

x+Bx = ^{x^)^B^{x^) > miujaminiB^)- 

3 

Lemma would follow using the bound on (Tmin{B^),yj that we show below. 
Lower bound on amin{B^)'- Consider any w such that II2 = 1- We have: 



Z = w^B^w = - {'^'(U'fy = -Y.^'^^^'^^^^ 



p 

i:(i,i)en 



Note that, E[Z] = w^UU^w = w^w = 1 and E[Z''] = i ([/*)»)^ < ^ E^(^, (U'f^)^ = ^, 

where the second last inequality follows using incoherence of J7*. Similarly, maxj \{w, 
Hence, using Bernstein's inequality: 

6^^/2 mp 



Pr(|Z-E[Z]| >52fc) <exp(- 



1 + ^2fc/3/ifA;' 



That is, by using p as in the statement of the lemma with the above equation and using union 
bound, we get (w.p. > 1 — 1/n^): Vw,j w^B^w > 1 — 62k- That is, Vj, (7mm(^-') > (1 — S2k)- D 
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Lemma C.7. Let A'I,Q,p, and C/* be as defined in Theorem 2.5 and Lemma 5.6. Also, let S 
M^^^' he defined as: = lY.i:(i,j)&Q.{U^)^^ {U*)^^^ Then, w.p. at least 1 - ^: 

llC^'lb < l + <52fe,Vj (59) 
Proof of Lemma C.7. Let a; G M'^ and y € M'^ be two arbitrary unit vectors. Then, 



P 



That is, Z = x^0y = lY.i^ij{xHU^)^'^){y\U*f^). Note that, E[Z] = xt(C/*)t[/*y, E[Z2] = 

-pT^ii^KuWiyKU*)^^? < ^xt(C/*)tC/*x = g and maxi|(xt(C/*)W)(yt ([/*)»)! < l^. Lemma 
now follows using Bernstein's inequality and using bound for p given in the lemma statement. □ 

Finally, we provide a lemma to bound the second part of the error term (F). 

Lemma C.8. Let M,Q,p, and [/* be as defined in Theorem 2.5 and Lemma 5.6. Then, w.p. at 
least 1 — -"^ • 



7) *^ ' 



\\{BD-C)v*\\2 < <52fcdist(y*+\F*^ 



(60) 



where v* = vec{V*), i.e. v* 



V* 



V,: 



Proof of Lemma C.8. Let X £ W^'' and let x = vec{X) £ W^^ s.t. ||j;||2 = 1. Also, let Xp be the 
p-th column of X and x^ be the j-th column of X. 

Let u' = ([/*)(*) and u<'^ = ([/*)(*). Also, let Hi = {BW - C^), i.e.. 



p 



where iJ/ G M'=^'=. Note that, 



(61) 



Now, x^{BD - C)v* = J:,{x')HBW - C^)(y*)0) = 1 ^p, Also, using (61), 

y{p,q): 



Hence, applying Lemma C.5, we get w.p. at least 1 — ij: 



^{BD - C)v* = Y{x^)\BW - a){V*)^^) <W lY.^4?{V*^)\ Y.^Hl)lr (62) 

i PQ y j V « 
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Also, 

i i i 

= max(<)2(l - \\U'm\\l) < ^dist{U\U*f. (63) 
Using (62), (63) and incoherence of V* ^ we get (w.p. 1 — 1/n^), Vx: 

x\BD - C)v* < V ^dist(C/*, U*)\\x„\\2 < 52fcdist(C/*, [/*), 
^-^^ mp 

pq 

where we used the fact that ||2;p||2 ^ Vfc ||a^||2 = in the last step. Lemma now follows by 
observing max^^^^^^^^i x^BD - C)v* = \\{BD - C)v*\\2. □ 
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